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Abstract 


Optimal estimation problems for non-linear non-Gaussian state-space models do not typically admit analytic 
solutions. Since their introduction in 1993, particle filtering methods have become a very popular class of 
algorithms to solve these estimation problems numerically in an online manner, i.e. recursively as obser¬ 
vations become available, and are now routinely used in fields as diverse as computer vision, econometrics, 
robotics and navigation. The objective of this tutorial is to provide a complete, up-to-date survey of this 
field as of 2008. Basic and advanced particle methods for filtering as well as smoothing are presented. 

Keywords: Central Limit Theorem, Filtering, Hidden Markov Models, Markov chain Monte Carlo, Particle 
methods, Resampling, Sequential Monte Carlo, Smoothing, State-Space models. 



1 Introduction 


The general state space hidden Markov models, which are summarised in section 2.1, provide an extremely 
flexible framework for modelling time series. The great descriptive power of these models comes at the 
expense of intractability: it is impossible to obtain analytic solutions to the inference problems of interest 
with the exception of a small number of particularly simple cases. The “particle” methods described by this 
tutorial are a broad and popular class of Monte Carlo algorithms which have been developed over the past 
fifteen years to provide approximate solutions to these intractable inference problems. 


1.1 Preliminary remarks 

Since their introduction in 1993 [22], particle filters have become a very popular class of numerical methods 
for the solution of optimal estimation problems in non-linear non-Gaussian scenarios. In comparison with 
standard approximation methods, such as the popular Extended Kalman Filter, the principal advantage 
of particle methods is that they do not rely on any local linearisation technique or any crude functional 
approximation. The price that must be paid for this flexibility is computational: these methods are com¬ 
putationally expensive. However, thanks to the availability of ever-increasing computational power, these 
methods are already used in real-time applications appearing in fields as diverse as chemical engineering, 
computer vision, financial econometrics, target tracking and robotics. Moreover, even in scenarios in which 
there are no real-time constraints, these methods can be a powerful alternative to Markov chain Monte Carlo 
(MCMC) algorithms — alternatively, they can be used to design very efficient MCMC schemes. 

As a result of the popularity of particle methods, a few tutorials have already been published on the subject 
[3, 8, 18, 29]. The most popular, [3], dates back to 2002 and, like the edited volume [16] from 2001, it is 
now somewhat outdated. This tutorial differs from previously published tutorials in two ways. First, the 
obvious: it is, as of December 2008, the most recent tutorial on the subject and so it has been possible to 
include some very recent material on advanced particle methods for filtering and smoothing. Second, more 
importantly, this tutorial was not intended to resemble a cookbook. To this end, all of the algorithms are 
presented within a simple, unified framework. In particular, we show that essentially all basic and advanced 
methods for particle filtering can be reinterpreted as some special instances of a single generic Sequential 
Monte Carlo (SMC) algorithm. In our opinion, this framework is not only elegant but allows the development 
of a better intuitive and theoretical understanding of particle methods. It also shows that essentially any 
particle filter can be implemented using a simple computational framework such as that provided by [24], 
Absolute beginners might benefit from reading [17], which provides an elementary introduction to the field, 
before the present tutorial. 


1.2 Organisation of the tutorial 

The rest of this paper is organised as follows. In Section 2, we present hidden Markov models and the 
associated Bayesian recursions for the filtering and smoothing distributions. In Section 3, we introduce a 
generic SMC algorithm which provides weighted samples from any sequence of probability distributions. In 
Section 4, we show how all the (basic and advanced) particle filtering methods developed in the literature 
can be interpreted as special instances of the generic SMC algorithm presented in Section 3. Section 5 is 
devoted to particle smoothing and we mention some open problems in Section 6. 



2 Bayesian Inference in Hidden Markov Models 


2.1 Hidden Markov Models and Inference Aims 

Consider an A”—valued discrete-time Markov process {X n } n>1 such that 

X\~p{x\) and X n \ (X n _i = x n -t) ~ / (x n \ x n _i) (1) 

where means distributed according to, p(x) is a probability density function and f (x\x') denotes the 
probability density associated with moving from x’ to x. All the densities are with respect to a dominating 
measure that we will denote, with abuse of notation, dx. We are interested in estimating {X n } n>1 but only 
have access to the y ~valued process {Y n } n>1 . We assume that, given {X n } n>1 , the observations {Y n } n>1 
are statistically independent and their marginal densities (with respect to a dominating measure dy n ) are 
given by 

Y n \{X n =x n )~g{y n \x n ). (2) 

For the sake of simplicity, we have considered only the homogeneous case here; that is, the transition and 
observation densities are independent of the time index n. The extension to the inhomogeneous case is 
straightforward. It is assumed throughout that any model parameters are known. 

Models compatible with (l)-(2) are known as hidden Markov models (HMM) or general state-space models 
(SSM). This class includes many models of interest. The following examples provide an illustration of several 
simple problems which can be dealt with within this framework. More complicated scenarios can also be 
considered. 


Example 1 - Finite State-Space HMM. In this case, we have X = {1,.... K\ so 

Pr (Xi = k)=p(k). Pr(X n = k\ X n _i = /) = / (k\ l). 

The observations follow an arbitrary model of the form (2). This type of model is extremely general and 
examples can be found in areas such as genetics in which they can describe imperfectly observed genetic 
sequences, signal processing, and computer science in which they can describe, amongst many other things, 
arbitrary finite-state machines. 

Example 2 - Linear Gaussian model. Here, X = M nx , y = R"», X 1 ~ J\f ( 0, E) and 

X n = AT n _i + BV n , 

Y n = CX n + DW n 

where V n 1 ~ 1 ' TV (0, W n ’ A?' TV (0, /„, u .) and A,B,C,D are matrices of appropriate dimensions. Note 
that Af (m, E) denotes a Gaussian distribution of mean m and variance-covariance matrix E, whereas 
Af( x ; m, E) denotes the Gaussian density of argument x and similar statistics. In this case p (x) = N (a:; 0, E), 
/ ( x'\x) = J\f (x'; Ax, BB t ) and g(y\x) = J\f (y; Cx, DD T ). As inference is analytically tractable for this 
model, it has been extremely widely used for problems such as target tracking and signal processing. 

Example 3 - Switching Linear Gaussian model. We have X = U x Z with U = (1,..., K} and Z = R ,lz . 
Here X n = ( U n ,Z n ) where {U n } is a finite state-space Markov chain such that Pr(17i = k) = pu (k), 
Pr ( U n = k\ U n -\ =1) = fu (k\l) and conditional upon {[/„} we have a linear Gaussian model with Zi\ U\ ~ 
Af(0, EyJ and 


Z n = Ajj n Z n _\ + Bjj n V n , 
Y n = C Un Z n + D Uri W n 





Figure 1: A simulation of the stochastic volatility model described in example 4. 


where V n Jf(0,I„ v ), W n 1 ~ d ' Af (0, I nvj ) and { A k ,B k ,Ck,D k ;h = 1, K} are matrices of appropriate 
dimensions. In this case we have y{x) = p(u,z) = flu (u)Af (z; 0, E„), f(x'\x) = f ((u' , z')\ (u, z)) = 
fu {u'\ u)Af (z'; A u >z, B U >B J) and g(yjx) = g (y\ ( u , z)) = Af ( y ; C u z, D U D J). This type of model provides 
a generalisation of that described in example 2 with only a slight increase in complexity. 

Example 4 - Stochastic Volatility model. We have X = y = R, X\ ~ Af ^0, x ^ q2 j and 

X n = aVi + <rV n , 

Y n = /3 exp (X n /2) W n 

where V n Af (0, 1) and W„ A/"(0,1). In this case we have fi(x) = Af(x;0, , f(x'\x) = 

Af (a/; ax, a 2 ) and g(y\x) = Af (y. 0, 3 2 exp (a;)). Note that this choice of initial distribution ensures that 
the marginal distribution of X n is also y (x) for all n. This type of model, and its generalisations, have been 
very widely used in various areas of economics and mathematical finance: inferring and predicting underlying 
volatility from observed price or rate data is an important problem. Figure 1 shows a short section of data 
simulated from such a model with parameter values a = 0.91, a = 1.0 and 3 = 0.5 which will be used below 
to illustrate the behaviour of some simple algorithms. 


Equations (l)-(2) define a Bayesian model in which (1) defines the prior distribution of the process of interest 
{X n } n>1 and (2) defines the likelihood function; that is: 


P (xi:n) = P (*l) n f(x k \x k -l) 
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(3) 









and 

P(yi:n\xi-. n ) = J[g(yk\Xk) , (4) 

where, for any sequence {z n } n>1 , and any i < j, Zi-.j := {zi % & §|i,... ,Zj ). 


In such a Bayesian context, inference about X 1:n given a realisation of the observations Y 1:n = y 1:n relies 
upon the posterior distribution 


P(xi:n\yi:n) 


P{Xl:n,yi:n) 

p(yi-.n) 


(5) 


where 


P(xi:n,yi:n) = P (^l:n) P ( Vl:n I ®l:n) , (6) 

and p (yi-.n) = Jp(xi: n ,yv.n)dx 1:n . (7) 

For the finite state-space HMM model discussed in Example 1, the integrals correspond to finite sums 
and all these (discrete) probability distributions can be computed exactly. For the linear Gaussian model 
discussed in Example 2, it is easy to check that p{x\ :n \y\ :n ) is a Gaussian distribution whose mean and 
covariance can be computed using Kalman techniques; see [1], for example. However, for most non-linear 
non-Gaussian models, it is not possible to compute these distributions in closed-form and we need to employ 
numerical methods. Particle methods are a set of flexible and powerful simulation-based methods which 
provide samples approximately distributed according to posterior distributions of the form p(x\ :n \ yi :n ) and 
facilitate the approximate calculation of p{yi-. n )- Such methods are a subset of the class of methods known 
as Sequential Monte Carlo (SMC) methods. 


In this tutorial, we will review various particle methods to address the following problems: 

• Filtering and Marginal likelihood computation: Assume that we are interested in the sequential approx¬ 
imation of the distributions {p (aJi :n | 2/i:n)}„>i and marginal likelihoods {p {yi-.n)} n >\- That is, we wish 
to approximate p(x i| y\ ) and p{yi) at the first time instance, p(a;i : 2| 1/1:2) and p (1/1,2) at the second time 
instance and so on. We will refer to this problem as the optimal filtering problem. This is slightly at vari¬ 
ance with the usage in much of the literature which reserves the term for the estimation of the marginal 
distributions {p{x n \ 2/i:n)}„>i rather than the joint distributions {p(x \ :n \ yi-. n )} n >\- 

We will describe basic and advanced particle filtering methods to address this problem including auxiliary 
particle filtering, particle filtering with MCMC moves, block sampling strategies and Rao-Blackwellised 
particle filters. 

• Smoothing: Consider attempting to sample from a joint distribution p (ii :T | y 1:T ) and approximating the 
associated marginals {p {x n \ yi-.r)} where n = 1,...,T. Particle filtering techniques can be used to solve 
this problem but perform poorly when T is large for reasons detailed in this tutorial. We will describe 
several particle smoothing methods to address this problem. Essentially, these methods rely on the particle 
implementation of the forward filtering-backward smoothing formula or of a generalised version of the two- 
filter smoothing formula. 


2.2 Filtering and Marginal Likelihood 

The first area of interest, and that to which the vast majority of the literature on particle methods has 
been dedicated from the outset, is the problem of filtering: characterising the distribution of the state of 
the hidden Markov model at the present time, given the information provided by all of the observations 
received up to the present time. This can be thought of as a “tracking” problem: keeping track of the 
current “location” of the system given noisy observations — and, indeed, this is an extremely popular area 
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of application for these methods. The term is sometimes also used to refer to the practice of estimating the 
full trajectory of the state sequence up to the present time given the observations received up to this time. 


We recall that, following (l)-(2), the posterior distribution p{x\. n \y\ :n ) is defined by (5) — the prior is 
defined in (3) and the likelihood in (4). The unnormalised posterior distribution p{x\. n ,y\ :n ) given in (5) 
satisfies 

P Ol :n, Vl:n) = P (*l:n-l , Vl-.n-l) f (x n \ Z„_i) g (y n \x n ) . (8) 

Consequently, the posterior p{x\ :n \ y\ :n ) satisfies the following recursion 


P(Xl:n\yi:n) = P ( &fcn-l I Vl-.n-l) 


f( x „\ x n ~i)g (y n \x n ) 
p(y n \yi-.n-i) 


(9) 


where 

P(Vn\Vl-.n-l) = JP («„_i| /(*„| z n _i) g(y n \ x n ) dx n - 1:n (10) 

In the literature, the recursion satisfied by the marginal distribution p(x n \yi :n ) is often presented. It is 
straightforward to check (by integrating out x\ :n -\ in (9)) that we have 


p{x n \yi-.n) 


g {Vn\x n )p{x n \ Vl-.n-l) 

P(Vn\Vl-.n-l) 


( 11 ) 


where 

P {x n \ Vi-.n-i) = J f(x n \x n \)p(x n i\y im t)dx n i. (12) 

Equation (12) is known as the prediction step and (11) is known as the updating step. However, most 
particle filtering methods rely on a numerical approximation of recursion (9) and not of (11)-(12). 

If we can compute {p (xi :n \ yi-. n )} and thus {p (x n \ yi-. n )} sequentially, then the quantity p(yi-. n ), which is 
known as the marginal likelihood, can also clearly be evaluated recursively using 


P {Vi-.nl = P (Vi) n p W y^-k-i) 


(13) 


where p{yk\ Vi-.k-i) is of the form (10). 


2.3 Smoothing 

One problem, which is closely related to filtering, but computationally more challenging for reasons which will 
become apparent later, is known as smoothing. Whereas filtering corresponds to estimating the distribution 
of the current state of an HMM based upon the observations received up until the current time, smoothing 
corresponds to estimating the distribution of the state at a particular time given all of the observations 
up to some later time. The trajectory estimates obtained by such methods, as a result of the additional 
information available, tend to be smoother than those obtained by filtering. It is intuitive that if estimates 
of the state at time n are not required instantly, then better estimation performance is likely to be obtained 
by taking advantage of a few later observations. Designing efficient sequential algorithms for the solution of 
this problem is not quite as straightforward as it might seem, but a number of effective strategies have been 
developed and are described below. 

More formally: assume that we have access to the data yi-.r, and wish to compute the marginal distributions 
{P i x n\ Vi-.t)} where n = 1, ...,T or to sample from p {x\-t\ Vi-.t)- In principle, the marginals {p {x n \ Vi-.t)} 
could be obtained directly by considering the joint distribution p ( x\-t \ Vi-.t) and integrating out the variables 
(xi :n -i,x n+ i ,t). Extending this reasoning in the context of particle methods, one can simply use the identity 
p{xu\vi:t) = / p{xi-.T\yi-.T)dxi;ft^idx n+ i:T and take the same approach which is used in particle filtering: 
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use Monte Carlo algorithms to obtain an approximate characterisation of the joint distribution and then 
use the associated marginal distribution to approximate the distributions of interest. Unfortunately, as is 
detailed below, when n <C T this strategy is doomed to failure: the marginal distribution p{x n \y-\ :n ) occupies 
a privileged role within the particle filter framework as it is, in some sense, better characterised than any of 
the other marginal distributions. 

For this reason, it is necessary to develop more sophisticated strategies in order to obtain good smoothing 
algorithms. There has been much progress in this direction over the past decade. Below, we present two 
alternative recursions that will prove useful when numerical approximations are required. The key to the 
success of these recursions is that they rely upon only the marginal filtering distributions {p (x n y-\ :n )}- 


2.3.1 Forward-Backward Recursions 


The following decomposition of the joint distribution p(xi : f\ Vi-.t) 


1>{xv.t\vui) = p(xt\vi:t) n p (x n \ x n+1 ,y 1:T ) 

T—l 

= p(x T \yi: T ) Y[p{x n \x n+1 ,y 1:n ), 

71=1 

shows that, conditional on yi-.T, {X n } is an inhomogeneous Markov process. 


(14) 


Eq. (14) suggests the following algorithm to sample fromp {x\-t \ yi-.r)- First compute and store the marginal 
distributions {p(x n \ yi :n )} for n = 1, Then sample Xt ~ p(xt\ yi-.T ) and for n = T — 1, T — 2,..., 1, 
sample X n ~ p (x n \ X n+ \ ,yi :n ) where 


P{x n | 


></!:»} 


f (x„+i\x n )p(x n \y 1:n ) 
P{x n +l\yi:n) 


It also follows, by integrating out x n+ \-T) in Eq. (14), that 


p(x„\yi :T ) =p{x n \yi :n ) [ /( a "+il x ”) , 


(15) 


So to compute {p {x n \ yi-.r)}, we simply modify the backward pass and, instead of sampling fromp (x n \ x n+ i,yi 
we compute p{x n \ t/u) using (15). 


2.3.2 Generalised Two-Filter Formula 


The two-filter formula is a well-established alternative to the forward-filtering backward-smoothing technique 
to compute the marginal distributions {p(x„| Put)} [4]. It relies on the following identity 


P{x n \ Vi-.t) 


p(x n \y 1:n -i)p(y n , T \x n ) 

P(y n -.T\yi:n-l) 


where the so-called backward information filter is initialised at time n = T by p (pt| xt) = g (pt| xt) and 
satisfies 


p{y n -.T\x n ) 


f II 

J k=n +1 


Xk-i) Y[g(yk\x k )dx„ 


l+l :T 


(16) 







The backward information filter is not a probability density in argument x n and it is even possible that 
/ P {Vti-.t] x n ) dx n = oo. Although this is not an issue when p{y n -.T \x n ) can be computed exactly, it does 
preclude the direct use of SMC methods to estimate this integral. To address this problem, a generalised 
version of the two-filter formula was proposed in [5]. It relies on the introduction of a set of artificial 
probability distributions {p n (x n )} and the joint distributions 

T T 

Pn {Xn-.T | Vu-.t) OC.p n (x n ) / {x k \ X k -l) 9 ( Vk\X k ) , (17) 


which are constructed such that their marginal distributions, p n {x n \y n :T) oc p n {x n )p(y n: T\ x n ), are simply 
“integrable versions” of the backward information filter. It is easy to establish the generalised two-filter 


formula 


p(xi\yi :T ) 


y{x 1 )p(x 1 \y 1 .. T ) p(x n \yi:n-l)p{x n \y n -. T ) 

- =“7 -\-, P{Zn\yi-.T) OC - ~ ,- 7 - 

Pi {Xi) p n (x„) 


(18) 


which is valid whenever the support of p n (x n ) includes the support of the prior p n (x n )\ that is 


Pn (x n ) = / M (®l) n f (Xk\X k -l) dxi :n -i > 0 => p n (x n ) > 0. 

J fc=2 

The generalised two-filter smoother for {p(x n \y n: T)} proceeds as follows. Using the standard forward recur¬ 
sion, we can compute and store the marginal distributions {p{x n \ y\- n -i)}- Using a backward recursion, we 
compute and store {p(x n \y n: T)}- Then for any n = 1, ...,T we can combine p(x n \ yi :n -i) and p{x n \y n -T) to 
obtain p(x n \yi-T)- 

In [4], this identity is discussed in the particular case where p n ( x n ) = p n ( x n ). However, when computing 
{p{x n \yn:T)} using SMC, it is necessary to be able to compute p n (x n ) exactly hence this rules out the choice 
Pn (x n ) = Pn ( x n ) for most non-linear non-Gaussian models. In practice, we should select a heavy-tailed 
approximation of p n ( x n ) for p n (x n ) in such settings. It is also possible to use the generalised-two filter 
formula to sample from p(xi,t| 2/1 :t); see [5] for details. 


2.4 Summary 

Bayesian inference in non-linear non-Gaussian dynamic models relies on the sequence of posterior distri¬ 
butions {p(xi :n \yi-. n )} and its marginals. Except in simple problems such as Examples 1 and 2, it is not 
possible to compute these distributions in closed-form. In some scenarios, it might be possible to obtain 
reasonable performance by employing functional approximations of these distributions. Here, we will discuss 
only Monte Carlo approximations of these distributions; that is numerical schemes in which the distribu¬ 
tions of interest are approximated by a large collection of N random samples termed particles. The main 
advantage of such methods is that under weak assumptions they provide asymptotically (i.e. as N —>• oo) 
consistent estimates of the target distributions of interest. It is also noteworthy that these techniques can 
be applied to problems of moderately-high dimension in which traditional numerical integration might be 
expected to perform poorly. 


3 Sequential Monte Carlo Methods 


Over the past fifteen years, particle methods for filtering and smoothing have been the most common 
examples of SMC algorithms. Indeed, it has become traditional to present particle filtering and SMC as 
being the same thing in much of the literature. Here, we wish to emphasise that SMC actually encompasses a 
broader range of algorithms — and by doing so we are able to show that many more advanced techniques for 
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approximate filtering and smoothing can be described using precisely the same framework and terminology 
as the basic algorithm. 


SMC methods are a general class of Monte Carlo methods that sample sequentially from a sequence of target 
probability densities {7r„ (aq :n )} of increasing dimension where each distribution 7 r n (aq :n ) is defined on the 
product space X n . Writing 


TTn (*l:n) 


In (a?l:n) 

Z n 


(19) 


we require only that : X n —t R + is known pointwise; the normalising constant 


2 " = / 


In (*l:n) dx i: , 


( 20 ) 


might be unknown. SMC provide an approximation of 7Ti (aq) and an estimate of Z\ at time 1 then an 
approximation of 7T2 (aq^) and an estimate of Z^ at time 2 and so on. 


For example, in the context of filtering, we could have j n (aq.„) = p (aq :n , y i:n ), Z n = p (y 1:n ) so 7r„ (aq :n ) = 
p(%i:n\yi:n)- However, we emphasise that this is just one particular choice of target distributions. Not 
only can SMC methods be used outside the filtering context but, more importantly for this tutorial, some 
advanced particle filtering and smoothing methods discussed below do not rely on this sequence of target 
distributions. Consequently, we believe that understanding the main principles behind generic SMC methods 
is essential to the development of a proper understanding of particle filtering and smoothing methods. 


We start this section with a very basic review of Monte Carlo methods and Importance Sampling (IS). We 
then present the Sequential Importance Sampling (SIS) method, point out the limitations of this method 
and show how resampling techniques can be used to partially mitigate them. Having introduced the basic 
particle filter as an SMC method, we show how various advanced techniques which have been developed 
over the past fifteen years can themselves be interpreted within the same formalism as SMC algorithms 
associated with sequences of distributions which may not coincide with the filtering distributions. These 
alternative sequences of target distributions are either constructed such that they admit the distributions 
{P ( x i-.n\ Vi-.n)} as marginal distributions, or an importance sampling correction is necessary to ensure the 
consistency of estimates. 


3.1 Basics of Monte Carlo Methods 

Initially, consider approximating a generic probability density 7r n (aq :n ) for some fixed n. If we sample N 
independent random variables, X\. n ~ n n (aq :n ) for f = l,..., N. then the Monte Carlo method approximates 
7r n (*i:n) by the empirical measure 1 

Kn(xi:n) = JjYl (**:»') • 

where S Xu ( x ) denotes the Dirac delta mass located at xq. Based on this approximation, it is possible to 
approximate any marginal, say 7T„ (a;*), easily using 

T?n(Xk) = 6 (**)> 

and the expectation of any test function ip n : X n R given by 

In{<Pn) ■= J ‘Pn (*l:n) {x\:n) dx 1:n , 

x We persist with the abusive use of density notation in the interests of simplicity and accessibility; the alternations required 
to obtain a rigorous formulation are obvious. 



is estimated by 


In° fan) : = J Vn fa 1:n ) 7 T n (®l: n ) dx 1:n = -^ ^2 <Pn {X{. n ) . 

It is easy to check that this estimate is unbiased and that its variance is given by 

V [C° («$ = ^ (y <Pn (*l:a) (*l:n)^l:n “ 4 (¥>n)) • 

The main advantage of Monte Carlo methods over standard approximation techniques is that the variance of 
the approximation error decreases at a rate of 0(1/N) regardless of the dimension of the space X n . However, 
there are at least two main problems with this basic Monte Carlo approach: 

• Problem 1: If 7r n (x\ :n ) is a complex high-dimensional probability distribution, then we cannot sample 
from it. 

• Problem 2: Even if we knew how to sample exactly from 7r„ (x\ :n ), the computational complexity of such 
a sampling scheme is typically at least linear in the number of variables n. So an algorithm sampling exactly 
from 7r n (aq :n ), sequentially for each value of n, would have a computational complexity increasing at least 
linearly with n. 


3.2 Importance Sampling 

We are going to address Problem 1 using the IS method. This is a fundamental Monte Carlo method and 
the basis of all the algorithms developed later on. IS relies on the introduction of an importance density 2 
Qn {x\- n ) such that 

7T n (x\- n ) > 0 => q n (Xi :n ) > 0. 

In this case, we have from (19)-(20) the following IS identities 




(21) 


Z n = J w n ( x l:n) Qn { X 1 :n) 4*1 :n 

(22) 

where w n (x\- r 

») is the unnormalised weight function 



w {x ) _ln{xt :„) 

^ Qn( x l:n)' 


In particular, we can select an importance density q n (x\ :n ) from which it is easy to draw samples; e.g. 
a multivariate Gaussian. Assume we draw N independent samples X\. n ~ q n (x\- n ) then by inserting the 
Monte Carlo approximation of q n (x\ :n ) — that is the empirical measure of the samples X\ :n — into (21)-(22) 
we obtain 


N 

Kn(Xl:n) = J2 W n S Xl. n M , 

(23) 



(24) 

where 

W ‘ - Wd 

" Ef=, u 'n (XQ 

(25) 


2 Some authors use the terms proposal density or instrumental density interchangeably. 
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Compared to standard Monte Carlo, IS provides an (unbiased) estimate of the normalising constant with 
relative variance 


Vis [Z n \ 


m 


(Xl:„ 


{Xl:n. 


-dXi- n — 1 I 


(26) 


If we are interested in computing I n ( ip n ), we can also use the estimate 


In&r 


'■/ 


Pn (Xl , n 


n {Xl:n) dX\ A 


N 

{X\, n ) ■ 


Unlike I^f c (ip n ), this estimate is biased for finite N. However, it is consistent and it is easy to check that 
its asymptotic bias is given by 

lim N (4 s (<p n ) - I n (</>„)) = - [ * n j: Xl:n | (<p n (xi :n ) - 4 (<p n )) dx 1:n . 

W->oo J q n (Xl :n ) 

When the normalising constant is known analytically, we can calculate an unbiased importance sampling 
estimate — however, this generally has higher variance and this is not typically the case in the situations in 
which we are interested. 


Furthermore, d^(<p) satisfies a Central Limit Theorem (CLT) with asymptotic variance 

The bias being 0(1/N) and the variance 0(1/N), the mean-squared error given by the squared bias plus 
the variance is asymptotically dominated by the variance term. 

For a given test function, ip n (x-[ :n ). it is easy to establish the importance distribution minimising the asymp¬ 
totic variance of I^(ip n ). However, such a result is of minimal interest in a filtering context as this distribution 
depends on ip n (xi :n ) and we are typically interested in the expectations of several test functions. Moreover, 
even if we were interested in a single test function, say tp n ( x\ :n ) = x n , then selecting the optimal importance 
distribution at time n would have detrimental effects when we will try to obtain a sequential version of the 
algorithms (the optimal distribution for estimating <^„_i(a;i :n _i) will almost certainly not be — even similar 
to — the marginal distribution of in the optimal distribution for estimating y n (x\ :n ) and this will 

prove to be problematic). 

A more appropriate approach in this context is to attempt to select the q n (x\ :n ) which minimises the 
variance of the importance weights (or, equivalently, the variance of Z n ). Clearly, this variance is minimised 
for q n (xi :n ) = 7 t„ ( xi :n ). We cannot select q n ( x\. n ) = n n (xi :n ) as this is the reason we used IS in the first 
place. However, this simple result indicates that we should aim at selecting an IS distribution which is close 
as possible to the target. Also, although it is possible to construct samplers for which the variance is finite 
without satisfying this condition, it is advisable to select q n (xi :n ) so that w n {x\- n ) <C n < oo. 


3.3 Sequential Importance Sampling 

We are now going to present an algorithm that admits a fixed computational complexity at each time 
step in important scenarios and thus addresses Problem 2. This solution involves selecting an importance 
distribution which has the following structure 

Qn (Xtn) = Qn -1 (^l:n-l) Qn (®o| X\ n-l) 

=,i(*i)n^i—)• ( 2s ) 

fc=2 
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Practically, this means that to obtain particles X\. n ~ q n (x\- n ) at time n, we sample X\ ~ q\ (x-i ) at time 1 
then X' k ~ q k (xfc| X \. k _ s ) at time k for k = 2, The associated unnormalised weights can be computed 
recursively using the decomposition 


Wn (®1:„) = 


In (Zl :n) 
Qn (X 1:n ) 


ln-1 {X\:n-l) _ jn (h:J _ 

q n -1 (ari:n-i) 7n-i {xi-.n- 1) q n (x„l a;i:rt_*i) 


(29) 


which can be written in the form 


Wn {xi:n) = W n -1 ( X 1:n -1 ) ■ OL n (x 1:n ) 

= Wi ( Xi ) a k ( x 1:k ) 

k=2 


where the incremental importance weight function a n (xi :n ) is given by 


OL n (*l:n) 


_ 7n folm) _ 

7n-l (Xl:n-l) (x n | ®l:n-l) ’ 


(30) 


The SIS algorithm proceeds as follows, with each step carried out for i = 1,..., N: 


Sequential Importance Sampling 

At time n = 1 

• Sample X{ ~ gi(xi). 

• Compute the weights wi (Xf) and W{ oc wi (Xj). 

At time n> 2 

• Sample X l n ~ q n (x n \X\.„ ,). 

• Compute the weights 

W n {X\, n ) = W n _! (X^.,) • a„ (X’J , 
W* oc tu„ (Xj : „) . 


At any time, n, we obtain the estimates n n (x \- n ) (Eq. 23) and Z n (Eq. 24) of 7r n {x\- n ) and Z n , respectively. 
It is straightforward to check that a consistent estimate of Z n /Z n ~ i is also provided by the same set of 
samples: 



(*U- 


This estimator is motivated by the fact that 


I ot n (x 1:n )' 


(x 1:n - 


l) qn(Xn\xi m ^l)dx 1:n = j 


7n{Xv. n )Tr n -l{x 1: n-l)q n {x n \x 1: n-l) d 
ln-i{xv. n -i)q n {x n \x v . n -{) 


% _ 
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In this sequential framework, it would seem that the only freedom the user has at time n is the choice of 
Qn {x n \ xi :n _i) 3 . A sensible strategy consists of selecting it so as to minimise the variance of w n ( x \- n ). It is 
straightforward to check that this is achieved by selecting 

Qn Pt ( x n | ari :n _if = 7r n (x„ Xi ; „...i) 


as in this case the variance of 
is given by 


w n ( x i:n) conditional upon is zero and the associated incremental weight 


< Pt (*1 :n) 


7n {Xl,n-l) 
In 1 (Xl : „. l) 


fin (x 1:n ) dx n 
7n-l (ari:n-l) 


Note that it is not always possible to sample from n n (x n \X\- n -\). Nor is it always possible to compute 
a° pt (xi :n ). In these cases, one should employ an approximation of </° pt (x n \ aq :Tl _i) for q n (x n \ x 1:n _i). 


In those scenarios in which the time required to sample from q n (x n \ afi^i) and to compute a n ( Xi :n ) is 
independent of n (and this is, indeed, the case if q n is chosen sensibly and one is concerned with a problem 
such as filtering), it appears that we have provided a solution for Problem 2. However, it is important to 
be aware that the methodology presented here suffers from severe drawbacks. Even for standard IS, the 
variance of the resulting estimates increases exponentially with n (as is illustrated below; see also [28]). As 
SIS is nothing but a special version of IS in which we restrict ourselves to an importance distribution of the 
form (28) it suffers from the same problem. We demonstrate this using a very simple toy example. 


Example. Consider the case where X = R and 


n„{x 1:n ) <f= JJtt n (x k ) = A'(x fc ;0. 1), (31) 

ln(xi :„) = f[exp (-y) :i 

Z n = (2 t r) n/2 . 


We select an importance distribution 


Qn (xi-.„) = n % (xk) = II ( Xfc; °> a2 ) ■ 


In this case, 


have Vis < 


only for a 1 > \ and 



N 



It can easily be checked that 2 J 2 _i > -*■ f° r an y \ < cf 2 ^ V. the variance increases exponentially with n 
even in this simple case. For example, if we select a 2 = 1.2 then we have a reasonably good importance 
distribution as qk ( Xk ) « 7r n (#*,) but jv Vls Jf"^ r# (1.103)"^ 2 which is approximately equal to 1.9 x 10 21 for 

n = 1000! We would need to use AT w 2 x 10 23 particles to obtain a relative variance Vl ^f n ^ = 0.01. This is 
clearly impracticable. 

3 However, as we will see later, the key to many advanced SMC methods is the introduction of a sequence of target distributions 
which differ from the original target distributions. 
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3.4 Resampling 


We have seen that IS — and thus SIS — provides estimates whose variance increases, typically exponentially, 
with n. Resampling techniques are a key ingredient of SMC methods which (partially) solve this problem in 
some important scenarios. 

Resampling is a very intuitive idea which has major practical and theoretical benefits. Consider first an 
IS approximation n n ( x\- n ) of the target distribution n n ( x\ :n ). This approximation is based on weighted 
samples from q n (x\ :n ) and does not provide samples approximately distributed according to % (aq :Tl ). To 
obtain approximate samples from 7r„ ( x \ :n ), we can simply sample from its IS approximation 7r„ (aq :n ); that 
is we select X\. n with probability IT*. This operation is called resampling as it corresponds to sampling 
from an approximation n n ( x \- n ) which was itself obtained by sampling. If we are interested in obtaining 
N samples from 7r„ ( x \- n ), then we can simply resample N times from 7r n {x\ :n ). This is equivalent to 
associating a number of offspring N'^ with each particle X\. n in such a way that N^ :N = (Ab,..., N^) follow 
a multinomial distribution with parameter vector (iV, W^ N ) and associating a weight of 1/N with each 
offspring. We approximate n n (xi :n ) by the resampled empirical measure 

N ATi 

7f„ (xu n ) = Y2 (32) 

where E [Ab | RT :jV ] = NW^. Hence 7f„ (aq : „) is an unbiased approximation of 7r„ (aq ;n ). 

Improved unbiased resampling schemes have been proposed in the literature. These are methods of selecting 
Ab such that the unbiasedness property is preserved, and such that ¥ [Ab | I¥* :JV ] is smaller than that 
obtained via the multinomial resampling scheme described above. To summarize, the three most popular 
algorithms found in the literature are, in descending order of popularity/efficiency: 

Systematic Resampling Sample Ui ~ U [0, and define U, = U\ + for i = 2,.... N. then set 
Ab = ||[/j : J2 l k=\ Wn < Uj < EUi ^n}| the convention X^fc=i := 0- It i s straightforward to 
establish that this approach is unbiased. 

Residual Resampling Set Ab = (AT¥*J, sample A^," V from a multinomial of parameters (^ N, 

where W n oc IT^ — ¥ _1 A/^ then set = N^+ N n . This is very closely related to breaking the 
empirical CDF up into N components and then sampling once from each of those components: the 
stratified resampling approach of [7]. 

Multinomial Resampling Sample N^ :N from a multinomial of parameters (¥, IT r | :,v ) . 


Note that it is possible to sample efficiently from a multinomial distribution in O (N) operations. However, 
the systematic resampling algorithm introduced in [25] is the most widely-used algorithm in the literature 
as it is extremely easy to implement and outperforms other resampling schemes in most scenarios (although 
this is not guaranteed in general [13]). 

Resampling allows us to obtain samples distributed approximately according to n n (aq :n ), but it should be 
clear that if we are interested in estimating I n (y>„) then we will obtain an estimate with lower variance 
using 7f„ (xi :n ) than that which we would have obtained by using 7f„ {x\ :n ). By resampling we indeed add 
some extra “noise”— as shown by [9]. However, an important advantage of resampling is that it allows us 
to remove of particles with low weights with a high probability. In the sequential framework in which we 
are interested, this is extremely useful as we do not want to carry forward particles with low weights and 
we want to focus our computational efforts on regions of high probability mass. Clearly, there is always 
the possibility than a particle having a low weight at time n could have an high weight at time n + 1, in 
which case resampling could be wasteful. It is straightforward to consider artificial problems for which this 


13 





is the case. However, we will show that in the estimation problems we are looking at the resampling step is 
provably beneficial. Intuitively, resampling can be seen to provide stability in the future at the cost of an 
increase in the immediate Monte Carlo variance. This concept will be made more precise in section 3.6. 


3.5 A Generic Sequential Monte Carlo Algorithm 

SMC methods are a combination of SIS and resampling. At time 1, we compute the IS approximation (aq) 
of 7Ti (aq) which is a weighted collection of particles { W{, X \}. Then we use a resampling step to eliminate 
(with high probability) those particles with low weights and multiply those with high weights. We denote by 
| , X\ | the collection of equally-weighted resampled particles. Remember that each original particle X\ has 

N{ offspring so there exist N[ distinct indexes ji 7^ 32 7^ ■ ■ ■ ffi j N . such that X ^ = X ^ = ■ ■ ■ = X™ 1 = Xf. 
After the resampling step, we follow the SIS strategy and sample X\ ~ \ X\). Thus is 

approximately distributed according to 7Ti (aq) q -2 (.'£'21 aq). Hence the corresponding importance weights in 
this case are simply equal to the incremental weights c*2 (aq^)- We then resample the particles with respect 
to these normalised weights and so on. To summarise, the algorithm proceeds as follows (this algorithm is 
sometimes referred to as Sequential Importance Resampling (SIR) or Sequential Importance Sampling and 
Resampling (SIS/R)). 


Sequential Monte Carlo 


At time n = 1 

• Sample X\ ~ gi(xi). 

• Compute the weights uq (X|) and W[ oc uq (Xj). 

• Resample {W(, X{} to obtain N equally-weighted particles j X N , X\ j . 

At time n> 2 

• Sample X* ~ q n (x n \xi n ^ and set X{. n <- (x^X*) . 

• Compute the weights a n (X| :?l ) and W l n oc a n (X|. n ). 

• Resample to obtain N new equally-weighted particles |^,X* 1:n |. 


At any time n, this algorithm provides two approximations of 7r n (xi :n ). We obtain 

N 

*n(Xl:„) = Y, W n S Xi :n M 


after the sampling step and 


(33) 


(34) 
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after the resampling step. The approximation (33) is to be preferred to (34). We also obtain an approximation 
of i through 

i N 

As we have already mentioned, resampling has the effect of removing particles with low weights and multi¬ 
plying particles with high weights. However, this is at the cost of immediately introducing some additional 
variance. If particles have unnormalised weights with a small variance then the resampling step might be 
unnecessary. Consequently, in practice, it is more sensible to resample only when the variance of the unnor¬ 
malised weights is superior to a pre-specified threshold. This is often assessed by looking at the variability 
of the weights using the so-called Effective Sample Size (ESS) criterion [30, pp. 35-36], which is given (at 
time n) by 

®=(ew’) ■ 

Its interpretation is that in a simple IS setting, inference based on the N weighted samples is approximately 
equivalent (in terms of estimator variance) to inference based on ESS perfect samples from the target 
distribution. The ESS takes values between 1 and N and we resample only when it is below a threshold 
Nr', typically Nr = N/2. Alternative criteria can be used such as the entropy of the weights { W r ‘} which 
achieves its maximum value when W£ = In this case, we resample when the entropy is below a given 
threshold. 


Sequential Monte Carlo with Adaptive Resampling 


At time n = 1 

• Sample X\ ~ qi(xi). 

• Compute the weights w\ (Xj) and W[ oc w\ (Xj). 

• If resampling criterion satisfied then resample {W{,X[} to obtain N equally weighted particles {-j^,X] j and 
set {iT], X] } <- {^, A]}, otherwise set { W \, X\ j <- {Wf, Xj}. 

At time n> 2 

• Sample X' n ~ q n (x n I^L-i) and set X[, n (x],,, ,.A') . 

• Compute the weights a n (X| :n ) and oc W 7 n _ 1 a n (XJ. n ). 

• If resampling criterion satisfied, then resample {IT^, X\. n } to obtain N equally weighted particles |^,X* 1:n | 

and set {wt,^} otherwise set {iT^X^} {W^X^}. 
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In this context too we have two approximations of 7r n (xi :n ) 

N 

Kn (*1:4 = W n S Xl n (*l:n) - (35) 

i= 1 
N 

i=l 

which are equal if no resampling step is used at time n. We may also estimate Z n jZ n -1 through 

^ = ( 36 ) 
n 1 i= 1 

SMC methods involve systems of particles which interact (via the resampling mechanism) and, consequently, 
obtaining convergence results is a much more difficult task than it is for SIS where standard results (iid 
asymptotics) apply. However, there are numerous sharp convergence results available for SMC; see [10] for 
an introduction to the subject and the monograph of Del Moral [11] for a complete treatment of the subject. 
An explicit treatment of the case in which resampling is performed adaptively is provided by [12]. 

The presence or absence of degeneracy is the factor which most often determines whether an SMC algorithm 
works in practice. However strong the convergence results available for limitingly large samples may be, 
we cannot expect good performance if the finite sample which is actually used is degenerate. Indeed, some 
degree of degeneracy is inevitable in all but trivial cases: if SMC algorithms are used for sufficiently many 
time steps every resampling step reduces the number of unique values representing X -\, for example. For this 
reason, any SMC algorithm which relies upon the distribution of full paths X\ :n will fail for large enough 
n for any finite sample size, N, in spite of the asymptotic justification. It is intuitive that one should 
endeavour to employ algorithms which do not depend upon the full path of the samples, but only upon 
the distribution of some finite component x n -L-. n for some fixed L which is independent of n. Furthermore, 
ergodicity (a tendency for the future to be essentially independent of the distant past) of the underlying 
system will prevent the accumulation of errors over time. These concepts are precisely characterised by 
existing convergence results, some of the most important of which are summarised and interpreted in section 
3.6. 

Although sample degeneracy emerges as a consequence of resampling, it is really a manifestation of a deeper 
problem — one which resampling actually mitigates. It is inherently impossible to accurately represent a 
distribution on a space of arbitrarily high dimension with a sample of fixed, finite size. Sample impover¬ 
ishment is a term which is often used to describe the situation in which very few different particles have 
significant weight. This problem has much the same effect as sample degeneracy and occurs, in the absence 
of resampling, as the inevitable consequence of multiplying together incremental importance weights from 
a large number of time steps. It is, of course, not possible to circumvent either problem by increasing the 
number of samples at every iteration to maintain a constant effective sample size as this would lead to an 
exponential growth in the number of samples required. This sheds some light on the resampling mechanism: 
it “resets the system” in such a way that its representation of final time marginals remains well behaved at 
the expense of further diminishing the quality of the path-samples. By focusing on the fixed-dimensional 
final time marginals in this way, it allows us to circumvent the problem of increasing dimensionality. 


3.6 Convergence Results for Sequential Monte Carlo Methods 

Here, we briefly discuss selected convergence results for SMC. We focus on the CLT as it allows us to clearly 
understand the benefits of the resampling step and why it “works”. If multinomial resampling is used at 
every iteration 4 , then the associated SMC estimates of Z n jZ n and I n (tp n ) satisfy a CLT and their respective 

4 Similar expressions can be established when a lower variance resampling strategy such as residual resampling is used 
and when resampling is performed adaptively [12]. The results presented here are sufficient to guide the design of particular 
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asymptotic variances are given by 


N \J qi(x!) 7r fc _i(.a; 1: fe_i)gfc 


{ x k\ *l:fc-l) 


(37) 


I ””(*,) (f Vn '{Xl:n) Vn (*2:»| *l) <^2:n “ 4 (<Pn)) <&1 

(/ <Pn (a5l:n) T»* (ttfc+l:n| X 1:k ) dx k+1:n - I n (<£>„)) dx lu 


(38) 


+ /„ 




(<Pn (Xl :„) - 4 (</?„)) *Tn- 


A short and elegant proof of this result is given in [11, Chapter 9]; see also [9]. These expression are very 
informative. They show that the resampling step has the effect of “resetting” the particle system whenever 
it is applied. Comparing (26) to (37), we see that the SMC variance expression has replaced the importance 
distribution q n (x i : „) in the SIS variance with the importance distributions n k -i (xi :k -i) q k {x k \ x\-. k -i) 
obtained after the resampling step at time k — 1. Moreover, we will show that in important scenarios the 
variances of SMC estimates are orders of magnitude smaller than the variances of SIS estimates. 


Let us first revisit the toy example discussed in section 3.3. 


Example (continued). In this case, it follows from (37) that the asymptotic variance is finite only when 
a 2 > \ and 


compared to 


1 [ f n 2 „ (si) 
N J 


qi ( x i) 




q k (x k ) 



The asymptotic variance of the SMC estimate increases linearly with n in contrast to the exponential growth 
of the IS variance. For example, if we select a 2 = 1.2 then we have a reasonably good importance distribution 
as q k (x k ) « 7r n (x k ). In this case, we saw that it is necessary to employ TV « 2 x 10 23 particles in order 
to obtain Vl ^ T1 ] = io -2 for n = 1000. Whereas to obtain the same performance, VsM ^J^ Tt ] = io —2 , SMC 
requires the use of just N « 10 4 particles: an improvement by 19 orders of magnitude. 


This scenario is overly favourable to SMC as the target (31) factorises. However, generally speaking, the 
major advantage of SMC over IS is that it allows us to exploit the forgetting properties of the model under 
study as illustrated by the following example. 


Example. Consider the following more realistic scenario where 


In (Xl:n) = H (®l) Y[ Mk ( X *\II G k {%k) 
k =2 k =1 

with ii a probability distribution, M k a Markov transition kernel and G k a positive “potential” function. 
Essentially, filtering corresponds to this model with M k (x k \ x k -\) = / {x k \x k -i) and the time inhomoge¬ 
neous potential function G k (x k ) = g(y k \x k )- In this case, n k (x k \x\ :k -i) = n k (x k \x k ~i) and we would 

algorithms and the additional complexity involved in considering more general scenarios serves largely to produce substantially 
more complex expressions which obscure the important points. 
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typically select an importance distribution q k ( x k \ X\ :k -i) with the same Markov property qk {x k \Xuk-i) = 
q k (x fc |x fc _i). It follows that (37) is equal to 


N \J qi(xi) “J n k -i(x k -i 


7T„ (Xk-l:k) 


qk {xk\xk~_ 


and (38), for ip n (x 1: „) = (p ( x n ), equals: 


/ ffrj (f V ( x n)n n (* n | ®i) dx 2:n - 4 (y>)) 2 d Xl 

+ Sf=2 / (f V ^ (*»!**) dX fc - (<p)) 2 d®fc_l:fc 

+ / ^-^xllognlxilxn-o ^ (*») “ In (^)) 2 

where we use the notation I n (<p) for I n (<p n ). In many realistic scenarios, the model associated with 7r n (#i :n ) 
has some sort of ergodic properties; i.e. \/x k ,x' k e X n n (x n \ x k ) as 7r n (x n \ x' k ) for large enough n — k. In 
layman’s terms, at time n what happened at time k is irrelevant if n — k is large enough. Moreover, this 
often happens exponentially fast; that is for any ( x k ,x' k ) 

\ J kn {x„\ x k ) - 7r„ (x n \ x k )\dx n < P n ~ k 

for some /? < 1. This property can be used to establish that for bounded functions <p < ||<p|| 

|/ V 5 ( X n) K n {x n \ X k ) dx n — I (<p)| < P n ~ k |MI 
and under weak additional assumptions we have 

Kn {x k -. 


Tr k -i(x k -i)q k (x k \x k - 
for a finite constant A. Hence it follows that 


Vsmc ffn] ^ c-n 

Zl “ “AT’ 

Vsmc [in (<p)\ < 

for some finite constants C,D that are independent of n. These constants typically increase polynomi- 
ally/exponentially with the dimension of the state-space X and decrease as (3 —)• 0. 


3.7 Summary 

We have presented a generic SMC algorithm which approximates {7r rt (#i :n )} and {Z n } sequentially in time. 


• Wherever it is possible to sample from q n (x n Xi :re _i) and evaluate a n (xi :n ) in a time independent of 
n, this leads to an algorithm whose computational complexity does not increase with n. 

• For any k, there exists n > k such that the SMC approximation of n n {x\- k ) consists of a single particle 
because of the successive resampling steps. It is thus impossible to get a “good” SMC approximation 
of the joint distributions {7 t„ (xi :n )} when n is too large. This can easily be seen in practice, by 
monitoring the number of distinct particles approximating 7r„ (aq). 
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• However, under mixing conditions, this SMC algorithm is able to provide estimates of marginal distri¬ 
butions of the form 7r„ (x n -L+i-. n ) and estimates of Z n /Z n _i whose asymptotic variance is uniformly 
bounded with n. This property is crucial and explains why SMC methods “work” in many realistic 
scenarios. 

• Practically, one should keep in mind that the variance of SMC estimates can only expected to be 
reasonable if the variance of the incremental weights is small. In particular, this requires that we 
can only expect to obtain good performance if 7r n (a:i :n _i) ~ n n -i (xi-.n-i) and q n {x n \ x\ :n -\) w 
7T»C*jt|#i.-n-i); that is if the successive distributions we want to approximate do not differ much 
one from each other and the importance distribution is a reasonable approximation of the “optimal” 
importance distribution. However, if successive distributions differ significantly, it is often possible to 
design an artificial sequence of distributions to “bridge” this transition [21, 31]. 


4 Particle Filtering 


Remember that in the filtering context, we want to be able to compute a numerical approximation of 
the distribution {p(x i :n | y-\sequentially in time. A direct application of the SMC methods described 
earlier to the sequence of target distributions 7r n {x\- n ) = p(x\ :n \ yi :n ) yields a popular class of particle filters. 
More elaborate sequences of target and proposal distributions yield various more advanced algorithms. For 
ease of presentation, we present algorithms in which we resample at each time step. However, in practice we 
recommend only resampling when the ESS is below a threshold and employing the systematic resampling 
scheme. 
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4.1 SMC for Filtering 


First, consider the simplest case in which 7 „ (aq : „) = p ( X \ :n , y\- n ) is chosen, yielding n n (aq : „) = p{x\- n \ yi :n ) 
and Z n = p (yi: n ). Practically, it is only necessary to select the importance distribution q n (x n \ xi :n _i). We 
have seen that in order to minimise the variance of the importance weights at time n, we should select 
<?° pt ( x n \ x\ :n -\) = 7r„ ( x n \ x\ :n -\) where 


n n {x n |ari :n _i) =p(x n \y n ,x n - 1 ) 

_ g(l/n|a;n)/(a; n |a; n -i) 

p(Vn |*n-l) ’ [ } 

and the associated incremental importance weight is a n (xi :n ) = p(y n \ £ n -i) • In many scenarios, it is not 
possible to sample from this distribution but we should aim to approximate it. In any case, it shows that 
we should use an importance distribution of the form 

<1,, (.'■'» = '/(•'» V„- 1) (40) 


and that there is nothing to be gained from building importance distributions depending also upon (yi :n -i,xi-. n 
— although, at least in principle, in some settings there may be advantages to using information from sub¬ 
sequent observations if they are available. Combining (29), (30) and (40), the incremental weight is given 
by 


a n {xi:n) = Ojn { x n-l:n) 


g(yn\x n )f(x„ |z»-l) 

q{x n \y n ,x n -\) 


The algorithm can thus be summarised as follows. 


SIR/SMC for Filtering 

At time n = 1 

• Sample X\ ~ q(x 1 \y 1 ). 

• Compute the weights w\ (Xf) = and W{ oc w\ ( X \). 

• Resample { W{,X {} to obtain N equally-weighted particles • 

At time n> 2 

• Sample X' n ~ q{x n \ y nr l&-i) and set X[. n <- (Xt :n _ x , AT*) • 

• Compute the weights a n (X 1 . ) = g ( j/rt | jY "M jY "l A "- 1 ) anc j w* oc a (X* ,, ) 

• Resample {W^,X|. n } to obtain N new equally-weighted particles |^,W* 1:n |. 


We obtain at time n 

N 

P(X 1:„| yi:n) = ^ (^l:n) , 
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SV Model: SIS Filtering Estimates 



Figure 2: Filtering estimates obtained for the stochastic volatility model using SIS. At each time the mean 
and standard deviation of x n conditional upon t/i :n is estimated using the particle set. It initially performs 
reasonably, but the approximation eventually collapses: the estimated mean begins to diverge from the truth 
and the estimate of the standard deviation is inaccurately low. 

N 

p{ Vn\ yi:n-t) = J2 W n-l a n ( X Ll:n) • 

However, if we are interested only in approximating the marginal distributions {p {x n \yi :n )} and {p (f/i :n )} 
then we need to store only the terminal-value particles {A), to be able to compute the weights: the 
algorithm’s storage requirements do not increase over time. 

Many techniques have been proposed to design “efficient” importance distributions q {x n \ y n , x n -i) which 
approximate p(x n \ y n ,x n _ i). In particular the use of standard suboptimal filtering techniques such as the 
Extended Kalman Filter or the Unscented Kalman Filter to obtain importance distributions is very popular 
in the literature [14, 37]. The use of local optimisation techniques to design q ( x n y n ,x n - 1) centred around 
the mode of p (x n \ y n , x n -\) has also been advocated [33, 34]. 


4.1.1 Example: Stochastic Volatility 

Returning to example 4 and the simulated data shown in figure 1, we are able to illustrate the performance 
of SMC algorithms with and without resampling steps in a filtering context. 

An SIS algorithm (corresponding to the above SMC algorithm without a resampling step) with N = 1000 
particles, in which the conditional prior is employed as a proposal distribution (leading to an algorithm in 
which the particle weights are proportional to the likelihood function) produces the output shown in figure 
2. Specifically, at each iteration, n, of the algorithm the conditional expectation and standard deviation of 
x n , given is obtained. It can be seen that the performance is initially good, but after a few iterations 
the estimate of the mean becomes inaccurate, and the estimated standard deviation shrinks to a very small 
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Figure 3: Empirical distributions of the particle weights obtained with the SIS algorithm for the stochastic 
volatility model at iterations 2, 10 and 50. Although the algorithm is reasonably initialised, by iteration 10 
only a few tens of particles have significant weight and by iteration 50 a single particle is dominant. 


value. This standard deviation is an estimate of the standard deviation of the conditional posterior obtained 
via the particle filter: it is not a measure of the standard deviation of the estimator. Such an estimate can 
be obtained by considering several independent particle filters run on the same data and would illustrate 
the high variability of estimates obtained by a poorly-designed algorithm such as this one. In practice, 
approximations such as the effective sample size are often used as surrogates to characterise the uncertainty 
of the filter estimates but these perform well only if the filter is providing a reasonable approximation of the 
conditional distributions of interest. Figure 3 supports the theory that the failure of the algorithm after a 
few iterations is due to weight degeneracy, showing that the number of particles with significant weight falls 
rapidly. 

The SIR algorithm described above was also applied to this problem with the same proposal distribution and 
number of particles as were employed in the SIS case. For simplicity, multinomial resampling was applied 
at every iteration. Qualitatively, the same features would be observed if a more sophisticated algorithm 
were employed, or adaptive resampling were used although these approaches would lessen the severity of 
the path-degeneracy problem. Figure 4 shows the distribution of particle weights for this algorithm. Notice 
that unlike the SIS algorithm shown previously, there are many particles with significant weight at all three 
time points. It is important to note that while this is encouraging it is not evidence that the algorithm 
is performing well: it provides no information about the path-space distribution and, in fact, it is easy to 
construct poorly-performing algorithms which appear to have a good distribution of particle weights (for 
instance, consider a scenario in which the target is relatively flat in its tails but sharply concentrated about 
a mode; if the proposal has very little mass in the vicinity of the mode then it is likely that a collection of 
very similar importance weights will be obtained — but the sample thus obtained does not characterise the 
distribution of interest well). Figure 5 shows that the algorithm does indeed produce a reasonable estimate 
and plausible credible interval. And, as we expect, a problem does arise when we consider the smoothing 
distributions p(x n \yi : $ oo) as shown in figure 6: the estimate and credible interval is unreliable for n <C 500. 
This is due to the degeneracy caused at the beginning of the path by repeated resampling. In contrast the 
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n=2 n=10 n=50 



Figure 4: Empirical distribution of particle weights for an SIR algorithm applied to the stochastic volatility 
model. Notice that there is no evidence of weight degeneracy in contrast to the SIS case. This comes at the 
cost of reducing the quality of the path-space representation. 


smoothed estimate for n w 500 (not shown) is reasonable. 


4.2 Auxiliary Particle Filtering 

As was discussed above, the optimal proposal distribution (in the sense of minimising the variance of im¬ 
portance weights) when performing standard particle filtering is q(x n \ y n ,x n - 1) = p(x n \ y n , x n -i). Indeed, 
a n (x n -i :n ) is independent of x n in this case so it is possible to interchange the order of the sampling and 
resampling steps. Intuitively, this yields a better approximation of the distribution as it provides a greater 
number of distinct particles to approximate the target. This is an example of a general principle: resampling, 
if it is to be applied in a particular iteration, should be performed before, rather than after, any operation 
that doesn’t influence the importance weights in order to minimise the loss of information. 

It is clear that if importance weights are independent of the new state and the proposal distribution corre¬ 
sponds to the marginal distribution of the proposed states then weighting, resampling and then sampling 
corresponds to a reweighting to correct for the discrepancy between the old and new marginal distribution 
of the earlier states, resampling to produce an unweighted sample and then generation of the new state from 
its conditional distribution. This intuition can easily be formalised. 

However, in general, the incremental importance weights do depend upon the new states and this straightfor¬ 
ward change of order becomes impossible. In a sense, this interchange of sampling and resampling produces 
an algorithm in which information from the next observation is used to determine which particles should 
survive resampling at a given time (to see this, consider weighting and resampling occurring as the very last 
step of the iteration before the current one, rather than as the first step of that iteration). It is desirable to 
find methods for making use of this future information in a more general setting, so that we can obtain the 
same advantage in situations in which it is not possible to make use of the optimal proposal distribution. 
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SV Model: SIR Filtering Estimates 



Figure 5: SIR Filtering estimates for the SV model. 


SV Model: SIR Smoothing Estimates 



Figure 6: SIR Smoothing estimates for the SV model. 
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The Auxiliary Particle Filter (APF) is an alternative algorithm which does essentially this. It was originally 
introduced in [33] using auxiliary variables — hence its name. Several improvements were proposed to 
reduce its variance [7, 34]. We present here the version of the APF presented in [7] which only includes one 
resampling step at each time instance. It has long been realised that, experimentally, this version outperforms 
the original two stage resampling algorithm proposed in [33] and is widely used; see [7] for a comparison of 
both approaches. The APF is a look ahead method where at time n we try to predict which samples will be 
in regions of high probability masses at time n+ 1. 

It was shown in [23] that the APF can be reinterpreted as a standard SMC algorithm applied to the following 
sequence of target distributions 


7n(*l:n) = P (Xl:n, Vl-.n) P ( Vn+1 \ X„) (41) 

with p(y n + i\x n ) chosen as an approximation of the predictive likelihood p(y n + i\x n ) if it is not known 
analytically. It follows that n n ( x \._ n ) is an approximation of p(x i :n | j/i :n +i) denoted p(xi :n \y\-. n +\) given by 

TTn On ;n ) = p ( x 1:n \ y 1:n+1 ) a p ( x 1:n \ y 1:n ) p ( y n+1 \ x n ) (42) 

In the APF we also use an importance distribution q n {x n \x\- n -\) of the form (40) which is typically 
an approximation of (39). Note that (39) is different from 7r n (x n \ xi :n -i) in this scenario. Even if we 
could sample from n n (x n \ x\ :n -\), one should remember that in this case the object of inference is not 
7r„ (xi-. n ) = P{xi :n \yi :n+ i) but p(x i, n | y\ :n ). The associated incremental weight is given by 

, X In (zi:n) 

OLn {Xn-V.n) = -7-T- 7~ [-T 

ln-l(Xl:n-l)qn{Xn\Xl:n-l) 

= g{yn\x n ) f {x n \x n _ 1 )p{y n+1 \x n ) 
p{y n \x n -i)q {x n \y n ,x n -{) 


To summarize, the APF proceeds as follows. 


Auxiliary Particle Filtering 


At time n = 1 

• Sample X\ ~ q(x 1 \y 1 ). 

• Compute the weights wi (X\) = M1 ) gV21 Xl ) and W{ oc wt (A|). 

• Resample {W[, X\] to obtain N equally-weighted particles . 

At time n> 2 

• Sample X l n ~ g(a; n \ pn^n-i) and set X{, n t- . 

• Compute the weights a n (X*_ lsB ) = ^ V ~[^ ^ ^ K 


• Resample {W^,X j. n } to obtain N new equally-weighted particles 


W-iJ- 
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Keeping in mind that this algorithm does not approximate the distributions {p (xi :n \yi :n )} but the distri¬ 
butions {p(x i : „| yi-.n+i)}, we use IS to obtain an approximation of p(x\ :n \ y\- n ) with 

7Tn-l {Xv.n-l) Qn ( X n \= f {xx: n -l\yx,n) Q (x„\y„, X„_i) 

as the importance distribution. A Monte Carlo approximation of this importance distribution is obtained 
after the sampling step in the APF and the associated unnormalised importance weights are given by 

~ / x = p{xi,n,yi-.n) g{yn\x n ) f {x n \x n ^) 

1: " ln-1 (xi:„_i)£? n (x n |xi:„_i) p ( y n \ X n - 1) q ( X n \ Vn, X n - 1) ' 

It follows that we obtain 

N _ 

p{xi: n \yi:n) = W n 5 X{. n (%:*!* 

p{Vn\VUn l)’ = | ljj l)^ 

where oc w n (X^_ Vn ) or W™ oc if resampling was not performed at the end of the 

previous iteration. Selecting q n (x n \x\ :n -x) = p(x n \y n ,x n -i) and p(y n \x n -i) = p{y n \xn-i), when it 
is possible to do so, leads to the so-called “perfect adaptation” case [33]. In this case, the APF takes a 
particularly simple form as a n ( x n -i :n ) = p (y n \ x n -i) and w n (x n -i :n ) = 1. This is similar to the algorithm 
discussed in the previous subsection where the order of the sampling and resampling steps is interchanged. 

This simple reinterpretation of the APF shows us several things: 


• We should select a distribution p(x\- n \ y\- n ) with thicker tails than p(x i,„| y\ :n ). 

• Setting p (y n \ x n -i) = g (y n \p ( x n -i )) where y denotes some point estimate is perhaps unwise as this 
will not generally satisfy that requirement. 

• We should use an approximation of the predictive likelihood which is compatible with the model we 
are using in the sense that it encodes at least the same degree of uncertainty as the exact model. 


These observations follows from the fact that p{x\- n \y\- n ) is used as an importance distribution to esti¬ 
mate p(x i,„| yi: n ) and this is the usual method to ensure that the estimator variance remains finite. Thus 
p(y n | x n -\) should be more diffuse than p (y n \ x n -±). 

It has been suggested in the literature to set p (y n \ x n -\) = g (y n \ y (x n -i)) where y{x n -\) corresponds 
to the mode, mean or median of f (x n \x n -i). However, this simple approximation will often yield an 
importance weight function (43) which is not upper bounded on X x X and could lead to estimates with a 
large/infinite variance. 

An alternative approach, selecting an approximation p(y n ,x n \x n -\) = p(y n \x n -i)q{x n \y n ,x n -i) of the 
distribution p (y n ,x n \ x n -\) = p (y n \ x n -\)p {x n \ y n ,x n - 1) = g(y n \x n ) / ( x n \x n -i) such that the ratio (43) 
is upper bounded on X x X and such that it is possible to compute p(y n \ x n _-|) pointwise and to sample 
from q (x n \ y n ,x n - 1), should be preferred. 


4.3 Limitations of Particle Filters 

The algorithms described earlier suffer from several limitations. It is important to emphasise at this point 
that, even if the optimal importance distribution p(x n \ y n ,x n - i) can be used, this does not guarantee that 
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the SMC algorithms will be efficient. Indeed, if the variance of p (y n \ x n -\) is high, then the variance of the 
resulting approximation will be high. Consequently, it will be necessary to resample very frequently and the 
particle approximation p(aq ;n | y-\ :n ) of the joint distribution p {x-\ :n \ y\ :n ) will be unreliable. In particular, 
for k -C n the marginal distribution p{X\- k \ Vi-.n) will only be approximated by a few if not a single unique 
particle because the algorithm will have resampled many times between times k and n. One major problem 
with the approaches discussed above is that only the variables { X l n } are sampled at time n but the path 
values remain fixed. An obvious way to improve upon these algorithms would involve not only 

sampling {A - *} at time n, but also modifying the values of the paths over a fixed lag {X '' n _ L+l . n _- [ } for 
L > 1 in light of the new observation y n ; L being fixed or upper bounded to ensure that we have a sequential 
algorithm (i.e. one whose computational cost and storage requirements are uniformly bounded over time). 
The following two sections describe two approaches to limit this degeneracy problem. 


4.4 Resample-Move 

This degeneracy problem has historically been addressed most often using the Resample-Move algorithm 
[20]. Like Markov Chain Monte Carlo (MCMC), it relies upon Markov kernels with appropriate invariant 
distributions. Whilst MCMC uses such kernels to generate collections of correlated samples, the Resample- 
Move algorithm uses them within an SMC algorithm as a principled way to “jitter” the particle locations 
and thus to reduce degeneracy. A Markov kernel K n {x\ :n \ x\ :n ) of invariant distribution p (Xi-. n \ y\ :n ) is a 
Markov transition kernel with the property that 

J P(Xl:n\yi-.n)K n (x' 1 . n \x 1 -. n )dx 1:n = p ( x' 1:n \ y 1:n ) . 

For such a kernel, if X\ :n ~ p(x\ :n \y\ :n ) and ~ K (x\ :n \ Xi ;n ) then X[. n is still marginally 

distributed according to p(x\- n \ y\- n ). Even if X%. n is not distributed according to p(x i :n | y\ :n ) then, after 
an application of the MCMC kernel, X' 1:n can only have a distribution closer than that of X 1:n (in total 
variation norm) to p (x\. n \ yi- n ). A Markov kernel is said to be ergodic if iterative application of that kernel 
generates samples whose distribution converges towards p {xi :n \ y\- n ) irrespective of the distribution of the 
initial state. 

It is easy to construct a Markov kernel with a specified invariant distribution. Indeed, this is the basis of 
MCMC — for details see [35] and references therein. For example, we could consider the following kernel, 
based upon the Gibbs sampler: set x' 1:n _ L = x\- n - k then sample x' n _ L+1 fromp (x n -t+i\ yi-. n , x' 1 . n _ L ,Xf h ^L+ 2 -.n), 
sample x' n _ L+2 from p ( x n 2 yv.m x i :n -L+i’ x n-L+3-n) and so on until we sample x' n from p (a;„| */i, a/,): 
that is 

K n (x' 1:n \x 1:n ) = S Xl:n _ L (x' 1:n _ L ) p{x' k \y 1:n ,x' 1 . k _ l ,x k+ i, n ) 

k=n— L+l 

and we write, with a slight abuse of notation, the non-degenerate component of the MCMC kernel K n (| x\ 
It is straightforward to verify that this kernel is p(a;i :n | yi : „)-invariant. 

If it is not possible to sample fromp ( x k \ yi: n ,x' 1 . k _ 1 ,x k+ i :n ) = p [x' k \ y k , x' k ^ t , x k+ \), we can instead employ 
a Metropolis-Hastings (MH) strategy and sample a candidate according to some proposal q ( x' k y k , x' k _ 1 ,x k:k+ i) 
and accept it with the usual MH acceptance probability 

min L n \ Ulfn) g {x k \yk,x' k _ v x' h ,x k+1 ) \ 

\ ' p{x\. k _i,x k+v . n \y 1:n ) q(x' k \y k ,x' k _ l ,x k:k+1 )) 

= min L 9 ( Vk I x' k ) f ( x k+1 1 x' k ) f ( x' k \ x' k _ x ) q(x k \ y k , x' k _ 1 , x' k ,x k+1 ) \ 

V g(yk\x k )f(x k+1 \x k )f(x k \x' k _ 1 )q(x' k \y k ,x' k _ 1 ,x k:k+1 ) J 

It is clear that these kernels can be ergodic only if L = n and all of the components of x\ :n are updated. 
However, in our context we will typically not use ergodic kernels as this would require sampling an increasing 
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number of variables at each time step. In order to obtain truly online algorithms, we restrict ourselves to 
updating the variables X n _L + \. n for some fixed or bounded L. 

The algorithm proceeds as follows, with K n denoting a Markov kernel of invariant distribution p{xi-. n \y\- n ). 


SMC Filtering with MCMC Moves 


At time n = 1 

• Sample X\ ~ q(x 1 \y 1 ). 

• Compute the weights w\ (Xj) = ar| d W? oc w\ {X\). 

• Resample {W{,X{} to obtain N equally-weighted particles j AT', j. 

• Sample X'j ~ Ki(xi\^). 

At time 1 < n < L 

• Sample X l n ~ q(x n \y n ,X'^_ 1 ) and set X\. n <- . 

• Compute the weights a n (X^_ 1:n ) = ar| d oc a n ■ 

• Resample {W^,Xf. n } to obtain N equally-weighted particles |^,X* 1:n |. 

• Sample X'l n ~ K n (x 1 :n \X[. n ). 

At time n> L 

• Sample X l n ~ q(x n \y n , and set X\ :n ^X,') . 

• Compute the weights a n (X^_ 1:n ) = anc j w» <x a n (X£_ 1:n ) . 

• Resample {W^,Xj. n } to obtain N new equally-weighted particles 

• Sample X*_ i+1:n ~ K n {x n ^ L+l n \X\, n ) and set X[\ n «- f^ n -i.^L+i:n) • 


The following premise, which [35] describes as “generalised importance sampling”, could be used to justify 
inserting MCMC transitions into an SMC algorithm after the sampling step. Given a target distribution 7r, 
an instrumental distribution y and a 7r-invariant Markov kernel, K. the following generalisation of the IS 
identity holds: 

J Tt{y)v{y)dy = JJ y{x)K(y\x ) ^ v{y)dxdy 

for any Markov kernel L. This approach corresponds to importance sampling on an enlarged space using 
y(x)K(y | x) as the proposal distribution for a target ir(y)L(x\ y) and then estimating a function ip'{x, y) = 


28 




c p(y ). In particular, for the time-reversal kernel associated with K 


L(x\y) = 


n(x)K{y\x) 

*{v) 


we have the importance weight 

n(y)L(x\y) = n(x) 
n(x)K(y\x) p(x)' 

This interpretation of such an approach illustrates its deficiency: the importance weights depend only upon 
the location before the MCMC move while the sample depends upon the location after the move. Even if the 
kernel was perfectly mixing, leading to a collection of iid samples from the target distribution, some of these 
samples would be eliminated and some replicated in the resampling step. Resampling before an MCMC step 
will always lead to greater sample diversity than performing the steps in the other order (and this algorithm 
can be justified directly by the invariance property). 

Based on this reinterpretation of MCMC moves within IS, it is possible to reformulate this algorithm as 
a specific application of the generic SMC algorithm discussed in Section 3. To simply notation we write 
q n (x n \x n -i) for q(x n \y n ,x n -i). To clarify our argument, it is necessary to add a superscript to the 
variables; e.g. X% corresponds to the p th time the random variable X is sampled; in this and the following 
section, this superscript does not denote the particle index. Using such notation, this algorithm is the generic 
SMC algorithm associated to the following sequence of target distributions 

n n (x{ L +\...,x^l i y^ L: _ x ^) 

= p(x^\....,x^i +1 ,x^_ L ,...,x 2 n \y^ 
x---xL 2 (x 2 1 ,x 1 2 \xl,x 2 2 )L 1 (x 1 1 \xl) 

where L n is the time-reversal kernel associated with K n whereas, if no resampling is used 5 , a path up to 
time n is sampled according to 

= gi (x\) Ki (a?i| x\) q 2 (x\\ xf) K 2 {x\,x\\x\,x\) 

x — • x 9n {x\\ a^-i) K n (x^+l +1 , ■ ■ ■ ,a: 5 _ 1 ,a:^| x^^_ L ,x ^_ L+1 ,... . 

This sequence of target distributions admits the filtering distributions of interest as marginals. The clear 
theoretical advantage of using MCMC moves is that the use of even non-ergodic MCMC kernels {K n } can 
only improve the mixing properties of {n n } compared to the “natural” sequence of filtering distributions; 
this explains why these algorithms outperform a standard particle filter for a given number of particles. 

Finally, we note that the incorporation of MCMC moves to improve sample diversity is an idea which is 
appealing in its simplicity and which can easily be incorporated into any of the algorithms described here. 


4.5 Block Sampling 

The Resample-Move method discussed in the previously section suffers from a major drawback. Although 
it does allow us to reintroduce some diversity among the set of particles after the resampling step over a 
lag of length L, the importance weights have the same expression as for the standard particle filter. So this 
strategy does not significantly decrease the number of resampling steps compared to a standard approach. 
It can partially mitigate the problem associated with resampling, but it does not prevent these resampling 
steps in the first place. 

5 Once again, similar expressions can be obtained in the presence of resampling and the technique remains valid. 
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An alternative block sampling approach has recently been proposed in [15]. This approach goes further than 
the Resample-Move method, which aims to sample only the component x n at time n in regions of high 
probability mass and then to uses MCMC moves to rejuvenate x n -L+i-. n after a resampling step. The block 
sampling algorithm attempts to directly sample the components x n _L+i-.n at time n; the previously-sampled 
values of the components .r,,./.^ n - i sampled are simply discarded. In this case, it can easily be shown that 
the optimal importance distribution (that which minimises the variance of the importance weights at time 


P{x„- L - 


n\yn-h- 


p{x„-L:n,yn-L+l-.n) 

p{y n -L+\:n\x n -L) 


(44) 


where 

P{yn-L+l-.n\x n -L) = J / ( X k \ X k -l) ■ g ( J/fc \ X k ) dx n - L+1:n . 

J k=n-L +1 


(45) 


As in the standard case (corresponding to L = 1), it is typically impossible to sample from (44) and/or to 
compute (45). So in practice we need to design an importance distribution q(x n -L+i-.n\yn-L+i:n,x n -L) 
approximating p {x n _L + \ :n \ y n -L+i-.n, x n-L)- Henceforth, we consider the case where L > 1. 


The algorithm proceeds as follows. 


SMC Block Sampling for Filtering 


At time n = 1 

• Sample X\ ~ q(x 1 \y 1 ). 

• Compute the weights wi (Xj) = and W[ oc wi (A'f). 

• Resample {Wl,X[} to obtain N equally-weighted particles j ^,X\ j . 
At time 1 < n < L 

• Sample X\. n ~ g(xi :n | y 1:n ). 

• Compute the weights a n {X\ :n ) = ^ 1 . !n p ,n j and W,\ oc a n (AT^_ 1: „) . 


• Resample 

At time n> L 

• Sample X z n _ L+1:ri 

• Compute the weights 


-i)- 




q(x n - L+1:n \y n - L+1: 

X :n-L,X'n- L+ l:n: 


,yi:n) q{X l n - L +\-.n-\\l 


P (^l:n-l.J/l:n-l) q(X l n _ L+1:n \y n - L+l:n 


(46) 
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* Resample j W^,X 1:n _ L ,Xl l _ L+1:n > to obtain N new equally weighted particles 


When the optimal IS distribution is used q(x n -L+i:n\yn-L+i:n>x n -L) = p{x n -L+i-.n\ yn-E+pYn*®**—4* we 
obtain 

... ^ p(Xl^t,X n -L+Vn,yin) n-llVn-L+l J Xn—L} 

^n-L+l:nj — /_ \ / I _ v 

P yi:n-l) P {X n - L + I:n \ y n -L+l:n, %n-L) 

= P{y n \yn-L+l:n,X n - L ) ■ 


This optimal weight has a variance which typically decreases exponentially fast with L (under mixing as¬ 
sumptions). Hence, in the context of adaptive resampling, this strategy dramatically reduces the number 
of resampling steps. In practice, we cannot generally compute this optimal weight and thus use (46) with 
an approximation of p (x n _ L+1:n \ y n -L+i-.n,x n -L) for q( x n _ L+1:n \ y n -L+i:n,x n -L)- When a good approx¬ 
imation is available, the variance of (46) can be reduced significantly compared to the standard case where 
L = 1. 


Again, this algorithm is a specific application of the generic SMC algorithm discussed in Section 3. To simply 
notation we write q n (x n -L+ i-. n \ x n -L) for q(x n -L+i-. n \ Vn-L+i-.n > Zn-4- To clarify our argument, it is again 
necessary to add a superscript to the variables; e.g. X% corresponds to the p th time the random variable 
Xk is sampled; remember that in the present section, this superscript does not denote the particle index. 
Using this notation, the algorithm corresponds to the generic SMC algorithm associated with the following 
sequence of target distributions 

tt„ (44..., 4-l+i* 44+2* • • •, 4) 

= V (tffcn-i+l* 4-1+2* • • • >4! yi-n) Qn -1 (4-1+1* •••> 4-11 4-l) 

X • • • Q‘2 ( 4 : 4) Ql (4) • 

where, if no resampling is used, a path is sampled according to 
n (r 1:L r 1:L W 

Qn {X i , . . . ,X n _ L+ 1 ,X n _f.^2 i ...,X n ) 

= Qi (4) Q2 (4*4) X • • • X q n (4 ,. fl . —,41 4-x) • 

The sequence of distributions admits the filtering distributions of interest as marginals. The mixing properties 
of {7r„} are also improved compared to the ‘natural’ sequence of filtering distributions; this is the theoretical 
explanation of the better performance obtained by these algorithms for a given number of particles. 


4.6 Rao-Blackwellised Particle Filtering 


Let us start by quoting Trotter [36]: “A good Monte Carlo is a dead Monte Carlo”. Trotter specialised in 
Monte Carlo methods and did not advocate that we should not use them, but that we should avoid them 
whenever possible. In particular, whenever an integral can be calculated analytically doing so should be 
preferred to the use of Monte Carlo techniques. 


Assume, for example, that one is interested in sampling from tt (a;) with x = (u,z) gU x Z and 


44 = 7 ^ =tt(u)tt(z\u) 


where n(u) = Z 1 q (u) and 


n(z\u) 


7 (u, z) 
7(4 


admits a closed-form expression; e.g. a multivariate Gaussian. Then if we are interested in approximating 
7r ( x ) and computing Z, we only need to perform a MC approximation 7r (u) and Z = f 7 (it) du on the space 
U instead of UxZ. We give two classes of important models where this simple idea can be used successfully. 
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4.6.1 Conditionally linear Gaussian models 


Consider X — U x Z with Z = R"*. Here X n = ( U n , Z n ) where {U n } is a unobserved Markov process such 
that Ui ~ fj,u (ui), U n | U n -i ~ fu (u n | U n -i) and conditional upon {!/„} we have a linear Gaussian model 
with Zi\Ui ~A/'(0,Sf/j) and 


Z n = A c/n Z„_ 1 +H [/ „y jl , 
y n = + D Un W n 

where V n ' Af ((),/„„), W n ' Af (0,/„ ra ) and for any u eW {^4„, B U ,C U , D u } are matrices of appropriate 
dimensions. In this case we have g(x) = g(u,z) = pu (u) Af (x; 0,S Z ), f(x'\x) = f ((u',z')\(u,z)) = 
fu (u'\u)Af (z';A u >z,B u >B^,) and g(y\x ) = g(y\ ( u,z )) = AT (y; C u z, D U D^) . The switching state-space 
model discussed in Example 3 corresponds to the case where {U n } is a finite state-space Markov process. 

We are interested in estimating 

P ( «l:n, yi:n) = P («l:n| Vl-.n) «l:n) • 

Conditional upon {U n }, we have a standard linear Gaussian model {Z„j, {Y n }. Hence p{z\ :n \y\ :n ,u\ :n ) 
is a Gaussian distribution whose statistics can be computed using Kalman techniques; e.g. the marginal 
p(z n | yi-.n,u i:n) is a Gaussian distribution whose mean and covariance can be computed using the Kalman 
filter. It follows that we only need to use particle methods to approximate 


7n(«l:n) = P («l:n, Vl-.n) 

= P{Ul:n)p(yi:n\Ul:n) 

where p (u\ :n ) follows from the Markov assumption on {U n } and p(yi :n \ wi:n) is a marginal likelihood which 
can be computed through the Kalman filter. In this case, we have 

5° Pt {U n | Wn, = P(Un\yi:n, «!:n l) 

= P(yn\yi:n-l,Ul:n) fu (UnlUn-i) 

p(yn\yi:n-l,Ul:„-l) 

The standard SMC algorithm associated with j n (u\ :n ) and the sequence of IS distributions q(u n \y 1:n , u\- n -\) 
proceeds as follows. 


SMC for Filtering in Conditionally Linear Gaussian Models 

At time n 1 

• Sample U{ ~ qiu^yf). 

• Compute the weights w\ (C/f) = ^ and W7 oc wq (U[). 

• Resample {W[,U{} to obtain N equally-weighted particles j . 

At time n> 2 

• Sample W n ~q (u n | yi-.nM-.n-i) and set Uf n <- (l • 
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• Compute the weights a„ (f7|. n ) = p ( Vn ^ ^ ^ and W1 oc a n (£/{.„) . 

• Resample {W^,f/f. n } to obtain N new equally-weighted particles 


This algorithm provides also two approximations of p(«i :n | y\ :n ) given by 

N 

p(u 1:n \ y\-n) = Y W h S V' 1M ( U hn) * 

1 N 

P(U1 : „\ yv.n) = Y 5 u[. n 

and 

1 N 

p{y n \yi:n l) =^Y an ' 

At first glance, it seems that this algorithm cannot be implemented as it requires storing paths of in¬ 

creasing dimension so as to allow the computation oip{y n \ yi :n -i,ui:n) and sampling from q(u n | y\ :n , u( ;n l\). 
The key is to realise that p(y n \yi :n -i,ui :n ) is a Gaussian distribution of mean y n \ n -i («i :n ) and covari¬ 
ance S n | n _i (ui :n ) which can be computed using the Kalman filter. Similarly, given that the optimal IS 
distribution only depends on the path u\ :n through p(y n \ yi-.n-i,Ui :n ), it is sensible to build an importance 
distribution q(u n \ yi-. n , which only depends on u\- n throughp {y n \ yi :n -i, ui :n ). Hence in practice, we 

do not need to store {U[. n } but only {[/*_ 1:rl } and the Kalman filter statistics associated with The 

resulting particle filter is a bank of interacting Kalman filters where each Kalman filter is used to compute 
the marginal likelihood term p(yi :n \ ui-.n)- 


4.6.2 Partially observed linear Gaussian models 

The same idea can be applied to the class of partially observed linear Gaussian models [2]. Consider 
X =U x Z with Z = R"*. Here X n = ([/„, Z n ) with Z\ ~ Af (0, Si) 

Z n = AZ n _i + BV n , 

U„ = CZ n + DW n 

where V n vlS?’ Af (0,I nv ), W n 1 '~ 1 ' Af (0,I nv ,); see [2] for generalisations. We make the additional assumption 
that 

g{y n \x n ) = g{y n \ (u n ,z n )) = g{y n \u n ). 

In this case, we are interested in estimating 

P{uv.n,Zl-.n\yi-.n) = P ( «l:n| Vl-.n) P (#Ua\ Vl-.n, «1:») 

= p(ui:„\yi-.„)p(z 1:n \u 1:n ) 

where p(z\ :n \ U\ :n ) is a multivariate Gaussian distribution whose statistics can be computed using a Kalman 
filter associated with the linear model {U n ,Z n }. It follows that we only need to use particle methods to 
approximate 


In 5 = p Vl-.n) 

= P{ui:n)p(yi-.n\ui:„) 
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where p (yi :n \ ui-.n) : Y\ 9 {Vk\ Mfc) and p ( u\- n ) is the marginal Gaussian prior of {!/„} which corresponds 
fe=1 

to the ‘marginal’ likelihood term which can be computed using the Kalman filter associated with {U n ,Z n } . 
The resulting particle filter is also an interacting bank of Kalman filters but the Kalman filters are here used 
to compute the marginal prior p ( u \- n ). 


5 Particle Smoothing 


We have seen previously that SMC methods can provide an approximation of the sequence of distribu¬ 
tions {p(xi :n \ yi-.n)}- Consequently, sampling from a joint distribution p(x\-t\ yi-.r) and approximating the 
marginals {p(x n \ yi-.r)} for n = 1 ,T is straightforward. We just run an SMC algorithm up to time T and 
sample from/marginalise our SMC approximation p{x\-t \ yi-.r)- However, we have seen that this approach 
is bound to be inefficient when T is large as the successive resampling steps lead to particle degeneracy: 
p{ x \:n\V\:T) is approximated by a single unique particle for n C T. In this section, we discuss various 
alternative schemes which do not suffer from these problems. The first method relies on an simple fixed-lag 
approximation whereas the other algorithms rely on the forward-backward recursions presented in Section 
2.3. 


5.1 Fixed-lag Approximation 

The fixed-lag approximation is the simplest approach. It was proposed in [26]. It relies on the fact that, for 
hidden Markov models with “good” forgetting properties, we have 

p(xl:n\ yir) W P (a?i „| J/l:min(,i+A,T)) (47) 

for A large enough; that is observations collected at times k > n + Ado not bring any additional information 
about xi :n . This suggests a very simple scheme — simply don’t update the estimate at time k after time 
k = n + A. Indeed, in practice we just do not resample the components X\. n of the particles X\, k at times 
k > n + A. This algorithm is trivial to implement but the main practical problem is that we typically do 
not know A. Hence we need to replace A with an estimate of it denoted L. If we select L < A, then 
P {x\- n \ t/i : mm(?i+£,T)) is a poor approximation oip{x\- n \ y-\ -t)- If we select a large values of L to ensure that 
L > A then the degeneracy problem remains substantial. Unfortunately automatic selection of L is difficult 
(and, of course, for some poorly-mixing models A is so large that this approach is impractical). Experiments 
on various models have shown that good performance were achieved with L « 20 — 50. Note that such fixed- 
lag SMC schemes do not converge asymptotically (i.e. as N —>• oo) towards the true smoothing distributions 
because we do not have p(x i :n | yi-.r) = P (xi : „| yi-. m in(n+L,T))■ However, the bias might be negligible and 
can be upper bounded under mixing conditions [8]. It should also be noted that this method does not provide 
an approximation of the joint p(xi : T\yv.T) — it approximates only the marginal distributions. 


5.2 Forward Filtering-Backward Smoothing 

We have seen previously that it is possible to sample fromp {x\-.t\ Vi-.t) and compute the marginals {p(x n \ yi-.r)} 
using forward-backward formulae. It is possible to obtain an SMC approximation of the forward filtering- 
backward sampling procedure directly by noting that for 

1 v 

P(x n \yi:n) = ^2 {x n ) 
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we have 


p(x n \X n+1 ,y 1:n ) 


f (X n+1 \x n )p(x n \yi :n ) 
f f ( X n+ i | x n ) p ( x n | y 1:n ) dx r 
y W'J {X n+l \X' n ) 8xj (x n ) 

h E$LiWif(x n+ 1 \xi) 


Consequently, the following algorithm generates a sample approximately distributed according top (zutI Vi-.t)- 
first sample X T ~ p (xt\ Vi-.t) and for n = T — l,T — 2,1, sample X n ~ p (x n \ X n+ \,y 1:n ). 


Similarly, we can also provide an SMC approximation of the forward filtering-backward smoothing procedure 
by direct means. If we denote by 

N 

P{Xn\Vl:T) = W r,T 6 Xi, (•*») (48) 

the particle approximation of p(x n \yi : T) then, by inserting (48) into (15), we obtain 


P(x„\ 


) = Y, W n 


= Y, W n\T^{x n ). 


f(K + i|x») 

r [E^i WU (xi + 1 \xt n )] 


6 .X‘(*n) 


(49) 


The forward filtering-backward sampling approach requires O (NT) operations to sample one path approx¬ 
imately distributed according to p(x i : t | Vi-.t) whereas the forward filtering-backward smoothing algorithm 
requires O (N 2 T) operations to approximate {p(x n \ Vv.t)}- Consequently, these algorithms are only useful 
for very long time series in which sample degeneracy prevents computationally-cheaper, crude methods from 
working. 


5.3 Generalised Two-filter Formula 


To obtain an SMC approximation of the generalised two-filter formula (18), we need to approximate the 
backward filter {p(x n \y n: T)} and to combine the forward filter and the backward filter in a sensible way. To 
obtain an approximation of {p(x n \y n: T)} , we simply use an SMC algorithm which targets the sequence of 
distributions {p(x n: T\y n :T)} defined in (17). We run the SMC algorithm “backward in time” to approximate 
p(xt\Vt) then p(x T -i-t\Vt-i-.t) and so on. We obtain an SMC approximation denoted 

n _ 

p(Xn:T\Vn:T) =^ W h S X' n:T {*»:'/) • 


To combine the forward filter and the backward filter, we obviously cannot multiply the SMC approximations 
of both p(x n \y\ :n -{) and p(x n -T\y n :T) directly, so we first rewrite Eq. (18) as 

, , x J f (x„\x n _ 1 )p(x n _ 1 \y 1:n _ 1 )dx n _ 1 -p(x n \y n:T ) 

p(x ” |! ' I!T,k -AW-■ 

By plugging the SMC approximations of p (x n -\\yi-. n -i) and p(x n \y n -.T) in this equation, we obtain 


N 

p(x„\yi-. T ) = ^ W* t 8-^ (x n ) 
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where 


W °n 


,\T 


N 

W^YjWn-1 


Pn (Xi) 


(50) 


Like the SMC implementation of the forward-backward smoothing algorithm, this approach has a compu¬ 
tational complexity 0(N 2 T). However fast computational methods have been developed to address this 
problem [27]. Moreover it is possible to reduce this computational complexity to O(NT) by using rejection 
sampling to sample from p(x n \ yi-.r) using p(x n - 1| yi-. n -i) and p(x n \y n: T ) as proposal distributions. More 
recently, an importance sampling type approach has also been proposed in [19] to reduce the computational 
complexity to O (NT)-, see [6] for a related idea developed in the context of belief propagation. Compared to 
the forward-backward formula, it might be expected to substantially outperform that algorithm in any situ¬ 
ation in which the support of the smoothed estimate differs substantially from that of the filtering estimate. 
That is, in those situations in which observations obtained at time k > n provide a significant amount of 
information about the state at time n 6 . The improvement arises from the fact that the SMC implementation 
of the forward-backward smoother simply reweights a sample set which targets p(x\, n \y\ :n ) to account for 
the information provided by y n +i-.k whereas the two filter approach uses a different sample set with locations 
appropriate to the smoothing distributions. 


6 Summary 


We have provided a review of particle methods for filtering, marginal likelihood computation and smooth¬ 
ing. Having introduced a simple SMC algorithm, we have shown how essentially all of the particle-based 
methods introduced in the literature to solve these problems can be interpreted as a combination of two op¬ 
erations: sampling and resampling. By considering an appropriate sequence of target distributions, defined 
on appropriate spaces, it is possible to interpret all of these algorithms as particular cases of the general 
algorithm. 

This interpretation has two primary advantages: 


1. The standard algorithm may be viewed as a particle interpretation of a Feynman-Kac model (see 
[11]) and hence strong, sharp theoretical results can be applied to all algorithms within this common 
formulation. 

2. By considering all algorithms within the same framework is is possible to develop a common un¬ 
derstanding and intuition allowing meaningful comparisons to be drawn and sensible implementation 
decisions to be made. 


Although much progress has been made over the past fifteen years, and the algorithms described above 
provide good estimates in many complex, realistic settings, there remain a number of limitations and open 
problems: 


• As with any scheme for numerical integration, be it deterministic or stochastic, there exist problems 
which exist on sufficiently high-dimensional spaces and involving sufficiently complex distributions that 
it is not possible to obtain an accurate characterisation in a reasonable amount of time. 

®This situation occurs, for example, whenever observations are only weakly informative and the state evolution involves 
relatively little stochastic variability. An illustrative example provided in [5] shows that this technique can dramatically 
outperform all of the other approaches detailed above. 
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• In many settings of interest the likelihood and state transition density are only known up to a vector 
of unknown parameters. It is typically of interest to estimate this vector of parameters at the same 
time as (or in preference to) the vector of states. Formally, such situations can be described as directly 
as a state space model with a degenerate transition kernel. However, this degeneracy prevents the 
algorithms described above from working in practice. 

• Several algorithms intended specifically for parameter estimation have been developed in recent years. 
Unfortunately, space constraints prevent us from discussing these methods within the current tutorial. 
Although progress has been made, this is a difficult problem and it cannot be considered to have been 
solved in full generality. These issues will be discussed in further depth in an extended version of this 
tutorial which is currently in preparation. 


It should also be mentioned that the SMC algorithm presented in this tutorial can be adapted to perform 
much more general Monte Carlo simulation: it is not restricted to problems of the filtering type, or even 
to problems with a sequential character. It has recently been established that it is also possible to employ 
SMC within MCMC and to obtain a variety of related algorithms. 
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